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Abstract  —  In  this  paper,  previously  reported  work  is 
extended  for  fusing  binary  valued  features.  In  general, 
when  mining  discrete  data  to  train  supervised  discrete 
Bayesian  classifiers,  it  is  often  of  interest  to  determine 
the  best  threshold  setting  for  maximizing  performance. 
In  this  work,  we  utilize  a  discrete  Bayesian  classifi¬ 
cation  model,  a  gain  function,  to  determine  the  best 
threshold  setting  for  a  given  number  of  binary  valued 
training  data  under  each  class.  Results  are  demon¬ 
strated  for  simulated  data  by  plotting  the  expected  gain 
versus  threshold  settings  for  different  numbers  of  train¬ 
ing  data.  In  general,  it  is  shown  that  the  expected  gain 
reaches  a  maximum  at  a  certain  threshold.  Further,  this 
maximum  point  varies  with  the  overall  quantization  of 
the  data.  Additional  results  are  also  shown  for  a  dif¬ 
ferent  gain  function  on  the  decision  variable,  that  are 
used  to  extend  previously  reported  results. 

Keywords:  Gain  function,  Noninformative  prior,  Dis¬ 
crete  binary  data,  Unknown  data  distribution. 

1  Introduction 

In  [11],  results  appeared  that  determined  the  best 
threshold  setting  for  maximizing  classification  perfor¬ 
mance,  for  the  problem  of  mining  discrete  data  to  train 
supervised  discrete  Bayesian  classifiers.  In  this  paper,  it 
is  of  interest  to  extend  this  work  by  reporting  on  results 
when  fusing  binary  valued  features,  and  for  using  an  ad¬ 
ditional  gain  function  on  the  decision  variable.  Before 
elaborating  on  these  new  results,  background  informa¬ 
tion  is  provided  about  the  methods  used  here  (this  also 
appeared  in  [11]). 

1.1  Background  on  the  Methods  Used 

A  problem  that  has  been  well  studied  involves  classi¬ 
fication  when  the  statistics  (i.e. ,  probabilistic  models) 
of  each  class  are  unknown  and  determined  empirically 
(some  examples  are  found  in  [4,  3,  5,  6,  9,  13,  14,  15,  12]) 
from  training  data  (i.e.,  supervised  learning).  For  ex¬ 
ample,  in  [12]  this  problem  was  studied  by  showing  the 
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performance  of  a  Bayesian  classification  test  (referred 
to  as  the  Combined  Bayes  Test  (CBT)),  which  com¬ 
bines  the  information  in  discrete  training  and  test  data 
to  infer  symbol  probabilities. 

As  previously  explained  in  Ref.  [12],  by  “discrete” 
it  is  meant  that  data  used  to  represent  each  class  can 
take  on  one  of  M  possible  values.  This  discrete  data 
may  have  arisen  naturally  in  its  M- level  form,  or  it 
may  have  been  derived  by  quantizing  “fused”  feature 
vectors.1  In  either  case,  with  the  situation  of  interest 
there  are  certain  labeled  realizations  of  this  (M- valued) 
data,  and  this  is  referred  to  as  the  “training”  data  un¬ 
der  both  classes.  That  is,  in  the  two  class  case  there 
are  Nciass  a  realizations  under  a  given  class  class  A 
and  Nciass  b  realizations  under  a  given  class  class  B. 
Also,  given  this  training  data,  it  is  assumed  that  Ny 
unlabeled  “test”  data  are  observed,  and  these  are  to 
be  simultaneously  tested  by  a  classifier.  Therefore,  the 
typical  classification  problem  utilizing  the  CBT  involves 
determining,  with  minimum  probability  of  error,  from 
which  class  the  unknown  test  data  have  been  generated. 

The  interesting  aspect  of  the  CBT  is  in  its  discrete 
observation  model.  In  particular,  the  CBT  was  devel¬ 
oped  in  [12]  using  the  multinomial  distribution  for  all 
independent  discrete  observations  of  training  and  test 
data,  and  the  Dirichlet  distribution  as  a  noninformative 
prior  (i.e.,  representing  complete  ignorance)  on  the  M 
symbol  probabilities.  Basically,  this  implies  that  the 
prior  probabilities  are  assumed  themselves  to  be  uni¬ 
formly  distributed  over  the  positive  unit  hyperplane. 

A  formula  for  the  average  probability  of  error,  P(e), 
was  also  developed  in  [12]  for  the  CBT,  and  it  is  typ¬ 
ically  used  to  illustrate  its  performance.  In  particular, 
based  on  this  formula  and  given  a  fixed  number  of  train¬ 
ing  and  test  data,  P(e)  was  shown  to  reach  a  minimum 

1  For  example,  three  binary  valued  features  can  take  on  M  = 
8  discrete  symbols:  (0,0,0),  (0,0,1),  . . .,  (1,1,1),  and  by  the 
same  convention  four  binary  valued  features  can  take  on  M  =  16 
discrete  symbols.  The  point  is  in  the  data  model  used  here  the 
overall  quantization  complexity,  M ,  can  be  considered  to  be  a 
collection  of  fused  features  with  the  equivalent  joint  cardinality. 
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at  a  particular  number  of  discrete  symbols  called  M* . 
Thus,  the  quantity  M*  in  the  CBT  is  useful  in  that  it 
represents  the  number  of  discrete  symbols,  or  the  joint 
quantization  fineness  of  feature  vectors,  associated  with 
best  classification  performance.  2 

1.2  New  problem  investigated 

For  the  problem  investigated  here  consider  that  there 
are  N  total  training  data,  with  each  composed  of  a  typ¬ 
ical  vector  of  fused  features  and  an  associated  scalar  s 
independent  of  the  classification  events.  Also,  it  will 
be  assumed  that  taken  as  aggregate  s  is  uniformly  dis¬ 
tributed  between  0  and  1.  In  this  problem  we  intend 
to  pick  a  threshold  r,  (0  <  r  <  1),  such  that  class  A 
is  composed  of  all  data  having  s  <  r  and  class  B  has 
s  >  t.  It  is  straightforward  to  see  that  the  number  of 
training  data  under  class  A  is  Nciass  a  =  r  *  N ,  and 
under  class  B  it  is  Nciass  b  =  (1  —  r)  *  iV  (or  at  least 
the  nearest  integers  thereto). 

In  modeling  this  problem  a  “gain”  function,  g(s),  is 
also  assigned,  and  it  is  assumed  that  the  classes  and 
fused  feature  data  will  be  represented  by  a  trained  CBT. 
Notice,  and  given  g(s),  the  expected  gain  obtained  from 
“investing”  in  a  test  datum  that  is  adjudged  to  be  in 
class  B  (i.e.,  by  the  CBT)  can  be  determined.  There¬ 
fore,  the  problem  that  we  address  in  this  paper  is  in 
estimating  the  best  threshold,  r,  which  yields  the  high¬ 
est  overall  expected  gain.  This  will  be  determined  as  a 
function  of  the  number  of  data,  and  of  the  quantization 
complexity  M  representing  the  fused  features.  Further, 
the  effect  of  different  gain  functions  will  also  be  inves¬ 
tigated,  and  in  this  paper  a  new  rational  gain  function 
is  shown  that  was  not  investigated  in  [11]. 

2  Mathematical  model  for  the 
new  problem  investigated 

As  stated  above,  the  goal  is  to  estimate  the  threshold 
t  that  yields  the  highest  overall  expected  gain  when  a 
test  datum  is  adjudged  to  be  in  class  B.  To  do  this, 
the  expected  gain  is  defined  as 

J(r)  =  ( T)pfa{Nci  ass  A->  Ncl 

ass  B)E(g(s)\s  <  t) 

+(1  -  r)pd(Nci  ass  Ai  Nd  ass  B)E(g(s)\s  >  t)  (1) 
where 

2  Much  of  the  results  shown  in  Ref.  [12]  was  an  extension  of 
work  given  by  Hughes,  which  is  known  in  the  literature  as  Hughes 
phenomenon  (for  example,  see  [3]). In  extending  Hughes’  result, 
performance  of  the  CBT  was  compared  to  an  uncombined  maxi¬ 
mum  likelihood  (ML)  based  test.  In  particular,  it  was  shown  that 
larger  numbers  of  test  data  cause  M*  to  increase  for  the  CBT 
with  an  overall  reduction  in  its  average  probability  of  error.  How¬ 
ever,  for  the  uncombined  test  larger  numbers  of  test  data  caused 
M*  to  either  remain  unchanged  or  decrease,  and  its  overall  aver¬ 
age  probability  of  error  increased.  With  these  results,  it  was  also 
shown  that  with  a  slight  modification  the  CBT  can  be  used  to  test 
the  statistical  similarity  of  two  discrete  data  sets  (i.e.,  whether 
they  were  produced  by  the  same  multinomial  distribution). 


r  and  (1  —  r)  represent  prior  probabilities; 
pfa(Nciass  A,  Nciass  B)  is  the  probability  of  deciding 
class  B  in  a  CBT  of  training  data  of  sizes  Nciass  a 
&  Nciass  B ,  when  in  fact  the  test  sample  is  truly  from 
class  A ; 

pd(Nci  ass  A  ->  Nciass  b )  is  the  probability  of  deciding 
class  B  in  a  CBT  of  training  data  sizes  Nciass  A  & 
Ndass  Bi  when  in  fact  the  test  sample  is  truly  from 
class  B\ 

From  Formula  (5)  in  Appendix  A 
note  that  pf  a(Nciass  A-,  Nciass  a)  — 

E  (Etdass  A )  E  ( Zclass  A  —  TZclass  B  \  Eclass  a)> 
and  pd(Nciass  At  Nclass  b)  —  1 

E  (H class  b)  E  ( Zclass  A  A  TZclass  B  \  Hciass  b\i 

The  expected  value  of  the  gain  function  for  s  <  r  is 
given  by  E(g(s)\s  <  r)  =  ^g(s)ds,  and  for  s  >  r  it 

is  computed  as  E(g(s)\s  >  r)  =  J1  jzrpg(s)ds. 

In  this  work,  to  obtain  results  two  gain  functions, 
g(s ),  will  be  utilized  in  Formula  (1).  In  particular, 
two  functions  are  utilized  having  the  respective  forms, 
g(s)  =  sc  and  g(s)  =  jyfq;-3  Note,  Figure  (1)  illus¬ 
trates  plots  of  the  three  gain  functions  appearing  in 
the  results  below,  and  for  comparing  performance  each 
function  purposely  increases  with  a  different  rate  as  r  is 
increased  (i.e.,  into  the  region  more  favoring  class  B). 

As  can  be  seen  in  Formula  (1)  above,  two  integrals 
(i.e.,  for  both  s  <  r  and  s  >  r  )  must  be  evaluated  for 
each  respective  gain  function.  Table  I  below  shows  the 
analytical  results  of  these  integrals. 

Table  1:  Analytical  expressions  of  the  expected  value 
of  various  gain  functions  for  s  <  r,  E(g(s)\s  < 
t)  =  fj  ±g(s)ds,  and  for  s  >  r,  Fl(flf(s)|s  >  r)  = 

Sr  l^r9{s)dS. 


g(s) 

£(s(s)|s  <  r) 

E(g(s)\s  >  r) 

sc 

TC 

CT 1 

_  zfli) 

1  — T  V  C+l  C+l  / 

CS 

cs+1 

ln(l+cr) 

c 

l  ln(l+c)  ,  ln(l+cr) 

c  c 

3  Results 

Figure  (2)  illustrates  a  plot  of  the  average  expected  gain 
J(t)  of  Formula  (1),  and  using  the  quadratic  gain  func¬ 
tion  g{s)s2  (see  Figure  (1),  and  c  =  2  in  Table  1),  ver¬ 
sus  the  decision  threshold  setting  r.  In  this  case,  four 
curves  are  shown  for  various  quantization  complexities 
M  of  respectively  (top  to  bottom  in  the  figure)  2,  4, 
32,  and  124  discrete  symbols.  This  corresponds  to,  re¬ 
spectively,  1,  4,  5,  and  7  binary  valued  fused  features. 
Also,  each  class  contains  100  samples  of  training  data. 
Recall,  the  objective  was  to  determine  the  effect  that 

,!The  variable  c  in  both  formulas  is  a  constant. 
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Figure  1:  Illustrating  a  plot  of  the  gain  functions  used 
in  generating  results  for  this  paper.  Shown  is  g(s)  =  sc, 
for  c  =  1  and  c  =  2,  and  g(s)  =  for  c  =  1.  Note, 
in  each  case  for  plotting  s  =  r. 

the  threshold  r  has  on  the  expected  gain  function  of 
Formula  (1),  which  can  then  be  used  to  predict  or  es¬ 
timate  t  yielding  best  performance  (i.e. ,  when  a  test 
datum  is  adjudged  to  be  in  class  B ).  Clearly,  the  ex¬ 
pected  gain  reaches  a  maximum  value  in  each  curve  (for 
M  =  2,  maximum  J(r)is  at  maximum  r),  and  which 
is  dependent  on  both  r  and  M.  Specifically,  it  can  be 
seen  in  Figure  (2)  that  as  M  is  increased  from  2  discrete 
symbols  to  124,  the  threshold  for  highest  expected  gain 
reduces  from  1  to  0.75.  Further,  the  overall  absolute 
value  of  the  expected  gain  reduces  as  well. 

In  general,  notice  that  for  the  general  gain  function, 
g(s)  =  sc,  used  here  the  average  expected  gain  tends  to 
increase  with  r  (i.e.,  higher  gains  are  associated  with 
larger  threshold  settings).  With  that,  in  the  CBT,  and 
for  a  fixed  number  of  training  data,  as  M  is  increased 
more  uncertainty  occurs  in  the  model  due  to  an  increase 
in  the  curse  of  dimensionality.  Thus,  intuitively  it  is 
not  surprising  that  the  best  overall  expected  gain  is 
associated  with  a  higher  decision  threshold.  Further, 
because  the  curse  of  dimensionality  predominates  with 
larger  values  of  M  (i.e.,  more  uncertainty  in  CBT  cell 
probability  estimates) ,  it  is  also  not  surprising  that  the 
overall  absolute  gain  decreases  with  M. 

As  a  supplemental  note  for  the  results  shown,  all  fig¬ 
ures  of  this  paper  were  obtained  using  Monte  Carlo 
Simulations.  Specifically,  the  results  are  based  on  an 
average  of  generating  50  sets  of  true  symbol  probabili¬ 
ties  for  each  class  (uniformly  distributed),  and  for  each 
of  these,  100  independent  trials  of  generating  training 
data.  Additionally,  because  Monte  Carlo  simulations 
were  used  as  opposed  to  the  complete  analytical  solu¬ 
tion  required  in  Formula  (1),  the  results  in  each  figure 
tend  to  have  a  jagged  appearance. 
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Figure  2:  Illustrating  a  plot  of  the  average  expected 
gain  J(t)  of  Formula  (1),  and  using  the  quadratic  gain 
function  g(s)  =  s2  (see  Figure  (1),  and  c  =  2  in  Table 
1),  versus  the  decision  threshold  setting  r.  In  this  case, 
four  curves  are  shown  for  various  quantization  complex¬ 
ities  M  of  respectively  2,  8,  32,  and  124  discrete  sym¬ 
bols. 


In  Figure  (3),  the  situation  of  Figure  (2)  is  repeated 
using  instead  the  linear  gain  function  g(s)  =  s  (see 
Figure  (1),  and  c  =  1  in  Table  1).  In  this  case,  it 
can  be  seen  that  overall  results  are  very  similar  to  that 
shown  in  Figure  (1).  However,  by  comparing  Figures 
(2)  and  (3)  it  can  now  be  seen  that  the  absolute  values 
for  the  gains  are  larger  using  a  linear  gain  function.  For 
example,  when  r  is  near  zero  in  Figure  (2)  J(r)  =  0.33, 
and  in  Figure  (3)  J(r)  =  0.33.  This  implies  that  when 
operating  at  a  best  threshold  setting  r,  and  for  a  given 
M,  a  linear  gain  function  will  yield  the  best  overall 
average  expected  gain.  Notice,  and  although  not  shown 
here,  if  the  gain  function  is  also  scaled  by  a  constant 
(e.g.,  in  the  linear  case  g(s)  =  c*  s),  then  for  all  c  >  1 
the  expected  gain  curves  increase  beyond  that  shown 
in  Figure  (3). 

In  Figure  (4),  the  situation  of  Figure  (2)  is  repeated, 
and  using  the  quadratic  gain  function  g(s)  =  s2,  but 
with  ten  samples  of  training  data  for  each  class.  In  this 
situation,  the  results  have  the  same  overall  trend  as 
in  Figure  (2),  however,  less  training  data  has  made  the 
curves  much  more  jagged.  Further,  note  that  the  overall 
average  gains  are  less,  and  the  thresholds  associated 
with  peak  gain  are  also  less  (i.e,  maximum  gains  are 
shifted  to  the  left).  Further,  performance  results  for 
higher  values  of  M  are  more  similar  (e.g.,  compare  M  = 
32  to  M  =  124).  All  of  these  trends,  of  course,  are  due 
to  an  increase  in  the  uncertainty  in  the  CBT  model  that 
results  when  very  little  training  data  is  used  to  estimate 
the  cell  probabilities  for  each  class. 

In  Figure  (5),  the  situation  of  Figure  (3)  is  repeated, 
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Figure  3:  The  situation  of  Figure  (2)  is  repeated  using 
instead  the  linear  gain  function  g(s)  =  s  (see  Figure 
(1),  and  c  =  1  in  Table  1). 

and  using  the  linear  gain  function  g(s)  =  s,  and  again 
with  ten  samples  of  training  data  for  each  class.  In  this 
situation,  the  results  have  the  same  overall  trend  as  in 
comparing  Figure  (4)  to  Figure  (2)  for  the  quadratic 
case  above. 

In  Figure  (6),  the  rational  gain  function  g(s)  = 

(see  Figure  (1),  and  c  =  1  in  Table  1)  is  used  to  obtain 
results,  and  with  100  samples  of  training  data  for  each 
class.  In  this  case,  the  rational  gain  function  is  uti¬ 
lized  to  help  illustrate  the  importance  of  gain  function 
shape  on  overall  results.  Notice  in  Figure  (1)  that  both 
the  quadratic  and  linear  gain  functions  increase  with 
increasing  threshold,  r,  and  at  a  steeper  rate  than  does 
the  rational  gain  function.  The  impact  of  this  on  perfor¬ 
mance  can  be  seen  in  Figure  (6),  where  the  overall  ex¬ 
pected  gain  decreases  to  a  minimum  point  with  r  before 
it  finally  increases  again.  With  that,  another  interest¬ 
ing  trend  is  that  for  larger  values  of  r  maximum  overall 
gains  are  now  associated  with  larger  M  values  (as  com¬ 
pared  to  the  opposite  trend  for  either  the  quadratic  or 
linear  gain  functions).  Recall,  in  this  case  we  are  de¬ 
termining  the  expected  gain  obtained  from  “investing” 
in  a  test  datum  that  is  adjudged  to  be  in  class  B  (i.e., 
by  the  CBT),  and  that  rational  gain  function  tapers  off 
with  increasing  r.  This  results  in  a  decreasing  expected 
overall  gain  until  r  is  relatively  high  (i.e.,  class  B  has 
s  >  t)  ,  and  the  likelihood  of  data  under  class  B  is  also 
very  high. 

In  Figure  (7),  the  situation  of  Figure  (6)  is  repeated, 
and  using  the  rational  gain  function  g(s)  =  jqq,  and 
with  ten  samples  of  training  data  for  each  class.  In 
this  situation,  the  results  have  the  same  overall  trend 
as  in  Figure  (6).  However,  it  is  also  apparent  that  the 
minimum  expected  gain  now  occurs  for  smaller  values 
of  r,  and  performance  for  larger  values  of  M  are  more 


Figure  4:  The  situation  of  Figure  (2)  is  repeated,  and 
using  the  quadratic  gain  function  g(s)  =  s2,  but  with 
10  samples  of  training  data  for  each  class. 

similar.  As  in  Figures  (2)  through  (5)  this  is  due  to  the 
small  number  of  training  data  used  for  estimating  cell 
probabilities. 

4  Summary 

In  this  paper,  results  were  demonstrated  in  training 
supervised  discrete  Bayesian  classifiers,  where  it  was 
of  interest  to  determine  the  best  threshold  setting  for 
maximizing  expected  gain  in  deciding  on  the  class  of 
an  unknown  test  data.  In  this  case,  the  CBT,  and  var¬ 
ious  gain  functions,  were  utilized  to  determine  the  best 
threshold  setting  for  a  given  number  of  training  data 
under  each  class.  Results  were  demonstrated  for  simu¬ 
lated  data  by  plotting  the  expected  gain  versus  thresh¬ 
old  setting  for  different  overall  quantization  levels,  and 
for  different  numbers  of  discrete  training  data.  In  gen¬ 
eral,  it  was  shown  that  the  expected  gain  reaches  a 
maximum  at  certain  thresholds,  which  depended  on  the 
overall  quantization  of  the  data.  Additionally,  results 
were  also  shown  for  different  gain  functions  on  the  de¬ 
cision  variable.  In  this  case,  it  turned  out  that  a  linear 
gain  function  produced  better  results  than  a  quadratic 
function.  Further,  when  using  a  rational  gain  function 
the  expected  gain  actually  reached  a  minimum  point 
before  reaching  a  maximum.  The  interesting  result  in 
this  is  that  it  the  rate  of  increase  in  the  gain  function, 
with  increasing  decision  threshold,  has  a  large  impact 
on  the  overall  expected  gain  in  correctly  classifying  test 
data. 

A  The  Combined  Bayes  Test 
(CBT)  and  Its  Implementa¬ 
tion 

The  CBT  is  repeated  here  as  it  appeared  in  [12]. 
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Figure  5:  The  situation  of  Figure  (3)  is  repeated,  and 
using  the  linear  gain  function  g(s)  —  s,  but  with  10 
samples  of  training  data  for  each  class. 

A.l  Combined  Information  Classifica¬ 
tion 

A.  1.1  Combined  Multinomial  Model 

With  this  model,  it  is  assumed  that  there  exists  a 
pair  of  probability  vectors,  p*,  and  p;,  the  ith  elements 
of  which  denote  the  probability  of  a  symbol  of  type  i  be¬ 
ing  observed  under  the  respective  classes  k  and  l.  The 
fundamental  model  for  this  testing  method  is  thus  for¬ 
mulated  based  on  the  number  of  occurrences  of  each  dis¬ 
crete  symbol  being  an  i.i.d.  multinomially  distributed 
random  variable.  Therefore,  the  joint  distribution  for 
the  frequency  of  occurrence  of  all  training  and  test  data 
with  the  test  data,  y,  a  member  of  class  k  is  given  by 
(boldface  indicates  a  vector  quantity) 


M 

f  (xfc,  x;,  y|pfc,  p;,  fffc)  =  Nk\Ni\Ny\ 

i=l 


where  4 


Xk.t+Vi  x,,t 
V k,i  r l,i 

XkJ-xi/.yJ. 


(2) 


k,  l  €  {class  A,  class  B },  and  k  l\ 

Hk  is  the  hypothesis  defined  as  py  =  P/t; 

M  is  the  number  of  discrete  symbols; 

Xk,i  is  the  number  of  occurrences  of  the  ith  symbol  in 
the  training  data  for  class  k: 

Nk  |./Vfc  =  £"**,,}  is  the  total  number  of  training 
data  for  class  k; 

y.i  is  the  number  of  occurrences  of  the  ith  symbol  in  the 
test  data; 

Ny  |^Vy  =  1 2/i}  is  the  total  number  of  test  data; 

Pk,i  Vk.i  =  l|  is  the  probability  of  the  ith  symbol 

for  class  k. 

4In  the  following  notation  k  and  l  are  exchangeable. 


Figure  6:  The  situations  of  Figures  (2)  and  (3)  are  re¬ 
peated  using  the  rational  gain  function  g(s)  =  * ,  (see 
Figure  (1),  and  c  =  1  in  Table  1),  and  with  100  samples 
of  training  data  for  each  class. 


A.  1.2  Combined  Bayes  Test  (CBT) 

Rather  than  assuming  that  pk  and  p;  are  simply  un¬ 
known  parameters  to  be  estimated  (and  the  resulting 
test  a  CGLRT5),  our  approach  here  is  to  give  them 
prior  distributions.  Nothing  a  priori  is  known  about  the 
probability  vectors,  and  hence  the  appropriate  prior  is 
one  of  complete  ignorance:  the  uniform  Dirichlet,  which 
is  given  by 


/(pO  =  (M-1)U{£MiPm=1j  (3) 

where  I{x}  is  the  indicator  function. 

The  CBT,  which  can  be  referred  to  as  a  Bayes  factor 
(see  [7]),  appears  as 

/(xfc,x,,y|gfc)  =  (Nk  +  M-  1)!  (Nj  +  Ny  +  M  -  1)! 
/  (xfc,  x;,  y|ff;)  (JV*  +  Ny  +  M  -  1)!  (Nt  +  M  -  1)! 

TT  {Xk,i  +  2/j)'  (xiyiV-  > 

X^1  (xk,i)'- {xi,i  +  Vi)'-  H, 

where  the  decision  threshold  r  is  equal  to 
P(Hi)/P(Hk)  for  minimizing  the  probability  of 
error. 

5The  combined  GLRT,  or  CGLRT  ([12];  also  see,  [13]),  rep¬ 
resents  the  correctly-posed  generalized  likelihood  ratio  procedure 
which  relies  on  ML  probability  estimates  culled  from  both  train¬ 
ing  and  test  data.  Notice  that  although  the  CGLRT  is  appealing 
from  a  practical  perspective,  from  a  theoretical  standpoint  it  is 
less  interesting  due  to  its  lack  of  optimality  in  non-asymptotic 
situations.  With  this,  our  preference  for  a  Bayesian  approach  to 
this  problem  has  been  substantiated  by  other  more  recent  results. 
Specifically,  the  probabilistic  structure  of  the  CBT  was  used  in 
[10]  with  simulated  and  real  data  to  reduce  the  number  of  sym¬ 
bols  (M)  for  improved  classification  performance  in  a  way  far 
superior  to  that  of  GLRT  based  methods. 
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Figure  7:  The  situation  of  Figure  (6)  is  repeated,  and 
using  the  rational  gain  function  g(s)  =  but  with 
10  samples  of  training  data  for  each  class. 

Note,  the  CBT  can  be  determined  (after  correct  sub¬ 
stitution  of  model  parameters,  and  a  slight  reworking  of 
the  result)  from  the  Multinomial-Dirichlet  distribution 
shown  in  [1],  In  fact,  the  data  reduction  method,  known 
as  the  Bayesian  Data  Reduction  Algorithm  (BDRA), 
developed  in  [10]  is  actually  based  on  a  conditional  CBT 
equivalent  to  the  Multinomial-Dirichlet. 

A. 1.3  Probability  of  Error 

Letting  zk  =  /  (x*,,  x/,  y\Hk)  (see  formula  (12)  above), 
the  average  probability  of  error  for  the  CBT  is  defined 
as 

P{e)  =  P{Hk)P{zk<TZl\Hk) 

+P{Hl)P{zk>TZl\Hl)  (5) 

It  is  necessary  to  only  show  the  first  term  of  (5)  as 
the  second  term  is  similar  except  for  conditioning  on  Hi . 
Thus,  ignoring  P  (Hk),  the  first  term  of  (5)  is  given  by 

p  ( Zk  <  TZl  I  Hk)  = 

<•>-*,}/  (xfc,xi,y|£4)  (6) 

y  xfc  xi 

where  /  (xk,  xi,y\Hk)  was  defined  in  formula  (2)  above. 
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